Polycot-data-from-NY.csvReading in the data (had to remove manually the ‘#’ sign from the column and save as csv file):
dat = read.table("data/Polycot-data-from-NY.csv", sep=",",header=TRUE)
str(dat)
## 'data.frame': 803 obs. of 4 variables:
## $ plant.ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ cotyledons: int 2 2 2 2 2 2 2 2 2 2 ...
## $ generation: Factor w/ 2 levels "O","p": 2 2 2 2 2 2 2 2 2 2 ...
## $ parents : Factor w/ 11 levels "","2x2","2x3",..: 1 1 1 1 1 1 1 1 1 1 ...
The data has 4 columns:
head(dat)
## plant.ID cotyledons generation parents
## 1 1 2 p
## 2 2 2 p
## 3 3 2 p
## 4 4 2 p
## 5 5 2 p
## 6 6 2 p
dat[200:210,]
## plant.ID cotyledons generation parents
## 200 200 2 O 2x2
## 201 201 2 O 2x2
## 202 202 2 O 2x2
## 203 203 2 O 2x2
## 204 204 2 O 2x2
## 205 205 2 O 2x2
## 206 206 2 O 2x2
## 207 207 2 O 2x2
## 208 208 2 O 2x2
## 209 209 2 O 2x2
## 210 210 2 O 2x2
We can summarize basic counts. For example, we have 622 offspring plants and 181 parent plants. On average, the plants have 2.691 cotelydons (mean=2.691). The median being 3 cotelydons means that 50% of the plants have 3 cotelydons or fewer
summary(dat)
## plant.ID cotyledons generation parents
## Min. : 1.0 Min. :2.000 O:622 :181
## 1st Qu.:201.5 1st Qu.:2.000 p:181 2x2 :152
## Median :402.0 Median :3.000 2x3 :139
## Mean :399.2 Mean :2.691 3x3 :109
## 3rd Qu.:602.5 3rd Qu.:3.000 4x4 : 96
## Max. :783.0 Max. :5.000 3x4 : 87
## (Other): 39
summary(dat$parents)
## 2x2 2x3 2x4 3x3 3x4 4x4 4x5 4x6 4x7 4x8
## 181 152 139 35 109 87 96 1 1 1 1
Now, we can combine groups in our data:
dat2 = within(dat, group <- "(2+)x(2+)")
dat2$group[dat2$parents == ""] = "parent"
dat2$group[dat2$parents == "2x2"] = "2x2"
dat2$group[dat2$parents == "2x3"] = "2x(2+)"
dat2$group[dat2$parents == "2x4"] = "2x(2+)"
dat2$group = factor(dat2$group,levels=c("parent", "2x2", "2x(2+)", "(2+)x(2+)"))
Standard t-test comparing:
Assumptions:
Plotting the densities:
Not really normal!
summary(dat2$group)
## parent 2x2 2x(2+) (2+)x(2+)
## 181 152 174 296
x = dat2$cotyledons[dat2$group == "2x2"]
y = dat2$cotyledons[dat2$group == "(2+)x(2+)"]
Variances look similar:
var(x)
## [1] 0.3275967
var(y)
## [1] 0.4926134
But let’s do a Levene test:
leveneTest(cotyledons ~ group, dat3, center=mean)
## Levene's Test for Homogeneity of Variance (center = mean)
## Df F value Pr(>F)
## group 1 0.2634 0.6081
## 446
The p-value is greater than 0.05, so we do not reject the null hypothesis of equal variances.
The Levene test also assumes Normality. So, we can run the alternative Fligner-Killeen test with the same conclusion (equal variances):
fligner.test(cotyledons ~ group, dat3)
##
## Fligner-Killeen test of homogeneity of variances
##
## data: cotyledons by group
## Fligner-Killeen:med chi-squared = 0.33916, df = 1, p-value =
## 0.5603
Back to the t-test assumptions:
If we ignore the violation of normality and we run a standard t-test, we get:
t.test(x,y)
##
## Welch Two Sample t-test
##
## data: x and y
## t = -6.8072, df = 363.34, p-value = 4.126e-11
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.5422314 -0.2991627
## sample estimates:
## mean of x mean of y
## 2.440789 2.861486
The p-value is less than 0.05, so we reject the null hypothesis of equality of means. This implies that we have evidence that the number of cotyledons is significantly larger in the offspring from “(2+)x(2+)” than from “2x2”.
Our justification to do a standard t-test is that if the sample size is moderately large, the two-sample t-test is robust to non-normality due to the central limit theorem.
Alternative to the t-test when normality is violated. This test does not test equality of means, but equality of distributions, so the conclusion could be deceiving (that is, two different distributions with the same mean will have a rejected null hypothesis using WMW test).
wilcox.test(cotyledons ~ group, dat3)
##
## Wilcoxon rank sum test with continuity correction
##
## data: cotyledons by group
## W = 15216, p-value = 7.461e-10
## alternative hypothesis: true location shift is not equal to 0
The p-value less than 0.05 allows us to also reject the null hypothesis of equality of distributions which provides evidence that the number of cotyledons on the “(2+)x(2+)” group is significantly higher than in the “2x2” group.
If we want to compare more than two groups, we can do so with a one-way ANOVA test. We will compare the number of cotyledons on the four groups:
Assumptions:
res.aov <- aov(cotyledons ~ group, data = dat2)
summary(res.aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## group 3 18.7 6.248 13.62 1.14e-08 ***
## Residuals 799 366.7 0.459
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We reject that all groups have the same mean number of cotyledons.
We can check the assumptions:
# Homogeneity of variance
plot(res.aov,1)
# Normality
plot(res.aov,2)
When normality is not met, we can use a Kruskal-Wallis rank sum test, with the same conclusion:
kruskal.test(cotyledons ~ group, data = dat2)
##
## Kruskal-Wallis rank sum test
##
## data: cotyledons by group
## Kruskal-Wallis chi-squared = 39.264, df = 3, p-value = 1.526e-08
We can move on to do Tukey multiple pairwise-comparisons (HSD test):
TukeyHSD(res.aov)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = cotyledons ~ group, data = dat2)
##
## $group
## diff lwr upr p adj
## 2x2-parent -0.2387685 -0.430637375 -0.0468997 0.0076856
## 2x(2+)-parent -0.0473741 -0.232532612 0.1377844 0.9124834
## (2+)x(2+)-parent 0.1819285 0.017371061 0.3464859 0.0234548
## 2x(2+)-2x2 0.1913944 -0.002228051 0.3850169 0.0540364
## (2+)x(2+)-2x2 0.4206970 0.246670761 0.5947233 0.0000000
## (2+)x(2+)-2x(2+) 0.2293026 0.062703782 0.3959014 0.0023572
Polycot-data-2020.csvReading in the data (had to remove manually the ‘#’ sign from the column and save as csv file):
dat = read.table("data/Polycot-data-2020.csv", sep=",",header=TRUE)
str(dat)
## 'data.frame': 5585 obs. of 2 variables:
## $ Cotyledons: int 2 2 2 3 2 2 2 2 3 2 ...
## $ Generation: Factor w/ 2 levels "O","P": 2 2 2 2 2 2 2 2 2 2 ...
head(dat)
## Cotyledons Generation
## 1 2 P
## 2 2 P
## 3 2 P
## 4 3 P
## 5 2 P
## 6 2 P
summary(dat)
## Cotyledons Generation
## Min. :2.000 O:3029
## 1st Qu.:2.000 P:2556
## Median :3.000
## Mean :2.595
## 3rd Qu.:3.000
## Max. :4.000
dat = within(dat, Generation<-factor(Generation, levels=c("P","O")))
Standard t-test comparing:
Assumptions:
Plotting the densities:
Not really normal!
x = dat$Cotyledons[dat$Generation == "P"]
y = dat$Cotyledons[dat$Generation == "O"]
Variances look similar:
var(x)
## [1] 0.2418511
var(y)
## [1] 0.2789809
But let’s do a Levene test:
leveneTest(Cotyledons ~ Generation, dat, center=mean)
## Levene's Test for Homogeneity of Variance (center = mean)
## Df F value Pr(>F)
## group 1 45.032 2.13e-11 ***
## 5583
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value is less than 0.05, so we reject the null hypothesis of equal variances.
The Levene test also assumes Normality. So, we can run the alternative Fligner-Killeen test with the same conclusion (equal variances):
fligner.test(Cotyledons ~ Generation, dat)
##
## Fligner-Killeen test of homogeneity of variances
##
## data: Cotyledons by Generation
## Fligner-Killeen:med chi-squared = 15.035, df = 1, p-value =
## 0.0001055
Back to the t-test assumptions:
If we ignore the violations and we run a standard t-test, we get:
t.test(x,y, var.equal=TRUE)
##
## Two Sample t-test
##
## data: x and y
## t = -29.717, df = 5583, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.4354871 -0.3815863
## sample estimates:
## mean of x mean of y
## 2.373239 2.781776
The p-value is less than 0.05, so we reject the null hypothesis of equality of means. This implies that we have evidence that the number of cotyledons is significantly larger in the offspring than in the parents.
Our justification to do a standard t-test is that if the sample size is moderately large, the two-sample t-test is robust to non-normality due to the central limit theorem.
We can also run the Welch version of t test that does not assume equal variances (this is the default in R):
t.test(x,y)
##
## Welch Two Sample t-test
##
## data: x and y
## t = -29.897, df = 5529.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.4353249 -0.3817485
## sample estimates:
## mean of x mean of y
## 2.373239 2.781776
Alternative to the t-test when normality is violated. This test does not test equality of means, but equality of distributions, so the conclusion could be deceiving (that is, two different distributions with the same mean will have a rejected null hypothesis using WMW test).
wilcox.test(Cotyledons ~ Generation, dat)
##
## Wilcoxon rank sum test with continuity correction
##
## data: Cotyledons by Generation
## W = 2417650, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
The p-value less than 0.05 allows us to also reject the null hypothesis of equality of distributions which provides evidence that the number of cotyledons on the offspring group is significantly higher than in the parent group.
Polycot-data-2020.csv focusing on chi-square testReading in the data (had to remove manually the ‘#’ sign from the column and save as csv file):
dat = read.table("data/Polycot-data-2020.csv", sep=",",header=TRUE)
str(dat)
## 'data.frame': 5585 obs. of 2 variables:
## $ Cotyledons: int 2 2 2 3 2 2 2 2 3 2 ...
## $ Generation: Factor w/ 2 levels "O","P": 2 2 2 2 2 2 2 2 2 2 ...
head(dat)
## Cotyledons Generation
## 1 2 P
## 2 2 P
## 3 2 P
## 4 3 P
## 5 2 P
## 6 2 P
summary(dat)
## Cotyledons Generation
## Min. :2.000 O:3029
## 1st Qu.:2.000 P:2556
## Median :3.000
## Mean :2.595
## 3rd Qu.:3.000
## Max. :4.000
dat = within(dat, Generation<-factor(Generation, levels=c("P","O")))
Creating categorical variable: dycots/polycots
dat2 = data.frame(generation=dat$Generation)
dat2$cots = "dycot"
dat2$cots[dat$Cotyledons > 2] = "polycot"
dat2 = within(dat2, cots<-factor(cots))
summary(dat2)
## generation cots
## P:2556 dycot :2437
## O:3029 polycot:3148
write.table(dat2,file="data/polycot-data-2020-categories.csv", row.names = FALSE, sep=",")
dt = table(dat2$cots, dat2$generation)
library(graphics)
mosaicplot(dt, shade = TRUE, las=2, main="Cotelydons")
dt
##
## P O
## dycot 1612 825
## polycot 944 2204
ct = chisq.test(dt)
obs = ct$observed
exp = ct$expected
df = rbind(obs[1,],exp[1,],obs[2,],exp[2,],colSums(obs))
rownames(df) = c("dycot observed", "dycot expected", "polycot observed", "polycot expected", "total")
r = rowSums(obs)
ss = sum(r)
totals = c(r[1],NA,r[2],NA,ss)
cbind(df,totals)
## P O totals
## dycot observed 1612.000 825.000 2437
## dycot expected 1115.304 1321.696 NA
## polycot observed 944.000 2204.000 3148
## polycot expected 1440.696 1707.304 NA
## total 2556.000 3029.000 5585
ct$statistic
## X-squared
## 722.1475
ct$p.value
## [1] 4.567529e-159
## expected values
2437*2556/5585
## [1] 1115.304
2437*3029/5585
## [1] 1321.696
3148*2556/5585
## [1] 1440.696
3148*3029/5585
## [1] 1707.304